<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>ArXiv CS.CV Papers (Image/Video Generation) - May 06, 2025</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/framer-motion/10.16.4/framer-motion.dev.js"></script>
    <!-- Example using Font Awesome (replace with your preferred icon library if needed) -->
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/all.min.css">
    <style>
        @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap');

        :root {
            /* New Palette: Light, Clean, Futuristic with Teal/Aqua accents */
            --bg-color: #f8fafc; /* Tailwind slate-50 (Very Light Gray) */
            --card-bg-color: #ffffff; /* White */
            --text-color: #1e293b; /* Tailwind slate-800 (Dark Gray-Blue) */
            --text-muted-color: #64748b; /* Tailwind slate-500 (Medium Gray-Blue) */
            --header-color: #0f172a; /* Tailwind slate-900 (Very Dark Blue) */
            --highlight-primary: #14b8a6; /* Tailwind teal-500 */
            --highlight-secondary: #67e8f9; /* Tailwind cyan-300 */
            --border-color: #e2e8f0; /* Tailwind slate-200 (Light Gray) */
            --shadow-color: rgba(15, 23, 42, 0.08); /* Subtle shadow based on slate-900 */
        }

        body {
            background-color: var(--bg-color);
            color: var(--text-color);
            font-family: 'Inter', sans-serif;
            overflow-x: hidden; /* Prevent horizontal scroll */
            line-height: 1.6;
        }

        .bento-grid {
            display: grid;
            gap: 1.5rem; /* Tailwind gap-6 */
            grid-template-columns: 1fr; /* Force single column */
            padding-bottom: 4rem; /* Add padding at the bottom */
        }

        .bento-item {
            /* Apply semi-transparent white background and blur */
            background-color: rgba(255, 255, 255, 0.7); /* White with 70% opacity */
            backdrop-filter: blur(10px); /* Apply blur effect */
            -webkit-backdrop-filter: blur(10px); /* Safari prefix */
            border-radius: 1rem; /* Slightly larger radius */
            padding: 1.75rem; /* Slightly more padding */
            border: 1px solid rgba(226, 232, 240, 0.5); /* Lighter border with transparency */
            box-shadow: 0 4px 12px var(--shadow-color);
            transition: transform 0.3s ease-out, box-shadow 0.3s ease-out, background-color 0.3s ease-out;
            overflow: hidden; /* Ensure content doesn't overflow */
            position: relative; /* For potential pseudo-elements */
        }

        /* Removed ::before pseudo-element for a cleaner look */


        .bento-item:hover {
            transform: translateY(-6px);
            box-shadow: 0 10px 20px var(--shadow-color), 0 4px 8px rgba(15, 23, 42, 0.06); /* Adjusted hover shadow */
        }

        .paper-title {
            font-size: 1.125rem; /* Tailwind text-lg */
            font-weight: 600; /* Tailwind font-semibold */
            color: var(--highlight-primary); /* Use new primary highlight */
            margin-bottom: 0.75rem; /* Tailwind mb-3 */
            line-height: 1.4;
        }

        .paper-summary {
            font-size: 0.875rem; /* Tailwind text-sm */
            color: var(--text-muted-color);
            margin-bottom: 1.25rem; /* Tailwind mb-5 */
            line-height: 1.6;
        }

        .paper-link {
            display: inline-flex; /* Use flex for icon alignment */
            align-items: center;
            font-size: 0.875rem; /* Tailwind text-sm */
            font-weight: 600;
            color: var(--highlight-primary);
            text-decoration: none;
            padding: 0.5rem 1rem; /* Add padding */
            border-radius: 0.5rem; /* Slightly rounder */
            background-color: rgba(20, 184, 166, 0.08); /* Subtle teal background */
            border: 1px solid rgba(20, 184, 166, 0.2);
            transition: background-color 0.3s ease, color 0.3s ease, transform 0.2s ease;
        }

        .paper-link i {
            margin-right: 0.5rem; /* Tailwind mr-2 */
            transition: transform 0.3s ease;
        }

        .paper-link:hover {
            background-color: rgba(20, 184, 166, 0.15);
            color: #0d9488; /* Darker teal on hover */
            transform: translateY(-1px);
        }
        .paper-link:hover i {
             transform: translateX(2px);
        }

        .paper-authors {
            font-size: 0.75rem; /* Tailwind text-xs */
            color: var(--text-muted-color);
            margin-top: 1rem; /* Tailwind mt-4 */
            font-style: italic;
        }

        .header {
            text-align: center;
            margin-bottom: 3rem; /* Tailwind mb-12 */
            padding-top: 3rem; /* Tailwind pt-12 */
        }

        .header h1 {
            font-size: 2.5rem; /* Tailwind text-4xl or 5xl */
            font-weight: 700; /* Tailwind font-bold */
            color: var(--header-color);
            letter-spacing: -0.025em; /* Tailwind tracking-tight */
            margin-bottom: 0.5rem;
            /* Optional: Add a subtle text gradient */
            /* background: linear-gradient(90deg, var(--highlight-primary), var(--highlight-secondary)); */
            /* -webkit-background-clip: text; */
            /* -webkit-text-fill-color: transparent; */
        }

        .header p {
            font-size: 1.125rem; /* Tailwind text-lg */
            color: var(--text-muted-color);
            margin-top: 0.5rem; /* Tailwind mt-2 */
            max-width: 600px;
            margin-left: auto;
            margin-right: auto;
        }

        .footer {
            text-align: center;
            color: var(--text-muted-color);
            font-size: 0.875rem; /* Tailwind text-sm */
            padding-top: 2rem;
            padding-bottom: 2rem; /* Tailwind py-8 */
            border-top: 1px solid var(--border-color);
            margin-top: 4rem;
        }

        /* Simple line graphic element (optional) */
        .line-graphic {
            height: 1px; /* Thinner line */
            background: linear-gradient(90deg, rgba(20, 184, 166, 0), var(--highlight-primary), rgba(20, 184, 166, 0));
            opacity: 0.6;
            margin: 1.5rem 0; /* Adjust margin */
        }

        /* Framer Motion requires the script, styles enhance appearance */
        [data-motion-element] {
             /* Base styles for elements animated by Framer Motion */
        }

        .paper-tldr {
            font-size: 0.95rem; /* Slightly bigger than summary */
            color: #475569; /* Changed to Tailwind slate-600 (slightly darker than summary) */
            margin-top: 0.75rem; /* Tailwind mt-3 */
            margin-bottom: 0.75rem; /* Tailwind mb-2 */
            /* font-style: italic; */
            font-weight: bold;
        }

        .paper-rating {
            margin-top: 1rem; /* Tailwind mt-4 */
            margin-bottom: 1rem; /* Tailwind mb-4 */
            color: #f59e0b; /* Tailwind amber-500 */
        }

        .paper-rating i {
            margin-right: 0.125rem; /* Tailwind mr-0.5 */
        }

        /* Apply consistent star color to sub-ratings */
        .paper-sub-ratings .rating-item i {
            color: #f59e0b; /* Match overall rating star color (amber-500) */
            margin-right: 0.125rem; /* Consistent spacing */
        }

    </style>
</head>
<body class="container mx-auto px-4 antialiased">

    <motion.div
        initial="{ opacity: 0, y: -30 }"
        animate="{ opacity: 1, y: 0 }"
        transition="{ duration: 0.6, ease: 'easeOut' }"
        class="header"
        data-motion-element
    >
        <h1>AIGC Daily Papers</h1>
        <p>Daily papers related to Image/Video/Multimodal Generation from cs.CV</p>
        <p>May 06, 2025</p>
        <div class="line-graphic mt-4 mb-8 mx-auto w-1/4"></div> <!-- Added line graphic -->
    </motion.div>

    <div class="bento-grid" id="paper-grid">
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.0, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation</h2>
            <p class="paper-summary">Synthesizing interactive 3D scenes from text is essential for gaming, virtual
reality, and embodied AI. However, existing methods face several challenges.
Learning-based approaches depend on small-scale indoor datasets, limiting the
scene diversity and layout complexity. While large language models (LLMs) can
leverage diverse text-domain knowledge, they struggle with spatial realism,
often producing unnatural object placements that fail to respect common sense.
Our key insight is that vision perception can bridge this gap by providing
realistic spatial guidance that LLMs lack. To this end, we introduce
Scenethesis, a training-free agentic framework that integrates LLM-based scene
planning with vision-guided layout refinement. Given a text prompt, Scenethesis
first employs an LLM to draft a coarse layout. A vision module then refines it
by generating an image guidance and extracting scene structure to capture
inter-object relations. Next, an optimization module iteratively enforces
accurate pose alignment and physical plausibility, preventing artifacts like
object penetration and instability. Finally, a judge module verifies spatial
coherence. Comprehensive experiments show that Scenethesis generates diverse,
realistic, and physically plausible 3D interactive scenes, making it valuable
for virtual content creation, simulation environments, and embodied AI
research.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: scenethesis is a training-free agentic framework that leverages llms and vision perception to generate realistic and physically plausible 3d scenes from text prompts, addressing limitations of existing methods.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: scenethesis是一个无需训练的agentic框架，它利用llm和视觉感知来从文本提示生成逼真且物理上合理的3d场景，解决了现有方法的局限性。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02836v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, Zhaoshuo Li</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.05, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves</h2>
            <p class="paper-summary">Recent studies have demonstrated that learning a meaningful internal
representation can both accelerate generative training and enhance generation
quality of the diffusion transformers. However, existing approaches necessitate
to either introduce an additional and complex representation training framework
or rely on a large-scale, pre-trained representation foundation model to
provide representation guidance during the original generative training
process. In this study, we posit that the unique discriminative process
inherent to diffusion transformers enables them to offer such guidance without
requiring external representation components. We therefore propose
Self-Representation A}lignment (SRA), a simple yet straightforward method that
obtain representation guidance through a self-distillation manner.
Specifically, SRA aligns the output latent representation of the diffusion
transformer in earlier layer with higher noise to that in later layer with
lower noise to progressively enhance the overall representation learning during
only generative training process. Experimental results indicate that applying
SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA
not only significantly outperforms approaches relying on auxiliary, complex
representation training frameworks but also achieves performance comparable to
methods that heavily dependent on powerful external representation priors.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper proposes self-representation alignment (sra), a method to improve diffusion transformer performance by using the model's inherent discriminative properties for representation guidance, avoiding the need for external representation components or pre-trained models.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文提出自表示对齐（sra）方法，通过利用扩散转换器固有的判别特性来进行表示引导，从而提高扩散转换器的性能，避免了对外部表示组件或预训练模型的需求。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02831v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, Jingdong Wang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.1, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models</h2>
            <p class="paper-summary">Text-to-image (T2I) diffusion models have rapidly advanced, enabling
high-quality image generation conditioned on textual prompts. However, the
growing trend of fine-tuning pre-trained models for personalization raises
serious concerns about unauthorized dataset usage. To combat this, dataset
ownership verification (DOV) has emerged as a solution, embedding watermarks
into the fine-tuning datasets using backdoor techniques. These watermarks
remain inactive under benign samples but produce owner-specified outputs when
triggered. Despite the promise of DOV for T2I diffusion models, its robustness
against copyright evasion attacks (CEA) remains unexplored. In this paper, we
explore how attackers can bypass these mechanisms through CEA, allowing models
to circumvent watermarks even when trained on watermarked datasets. We propose
the first copyright evasion attack (i.e., CEAT2I) specifically designed to
undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three
stages: watermarked sample detection, trigger identification, and efficient
watermark mitigation. A key insight driving our approach is that T2I models
exhibit faster convergence on watermarked samples during the fine-tuning,
evident through intermediate feature deviation. Leveraging this, CEAT2I can
reliably detect the watermarked samples. Then, we iteratively ablate tokens
from the prompts of detected watermarked samples and monitor shifts in
intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a
closed-form concept erasure method to remove the injected watermark. Extensive
experiments show that our CEAT2I effectively evades DOV mechanisms while
preserving model performance.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces a novel copyright evasion attack (ceat2i) against dataset ownership verification (dov) mechanisms in personalized text-to-image diffusion models, demonstrating its effectiveness in bypassing watermarks.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文提出了一种针对个性化文本到图像扩散模型中数据集所有权验证（dov）机制的新型版权规避攻击（ceat2i），并证明其在绕过水印方面的有效性。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02824v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Kuofeng Gao, Yufei Zhu, Yiming Li, Jiawang Bai, Yong Yang, Zhifeng Li, Shu-Tao Xia</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.15000000000000002, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing</h2>
            <p class="paper-summary">Current multi-subject customization approaches encounter two critical
challenges: the difficulty in acquiring diverse multi-subject training data,
and attribute entanglement across different subjects. To bridge these gaps, we
propose MUSAR - a simple yet effective framework to achieve robust
multi-subject customization while requiring only single-subject training data.
Firstly, to break the data limitation, we introduce debiased diptych learning.
It constructs diptych training pairs from single-subject images to facilitate
multi-subject learning, while actively correcting the distribution bias
introduced by diptych construction via static attention routing and dual-branch
LoRA. Secondly, to eliminate cross-subject entanglement, we introduce dynamic
attention routing mechanism, which adaptively establishes bijective mappings
between generated images and conditional subjects. This design not only
achieves decoupling of multi-subject representations but also maintains
scalable generalization performance with increasing reference subjects.
Comprehensive experiments demonstrate that our MUSAR outperforms existing
methods - even those trained on multi-subject dataset - in image quality,
subject consistency, and interaction naturalness, despite requiring only
single-subject dataset.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces musar, a framework for multi-subject image customization using only single-subject training data, addressing data scarcity and subject entanglement challenges through debiased diptych learning and dynamic attention routing; it claims superior performance over methods trained on multi-subject data.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了musar，一个仅使用单主体训练数据进行多主体图像定制的框架。它通过去偏二联画学习和动态注意力路由来解决数据稀缺和主体纠缠的挑战，并声称优于在多主体数据上训练的方法。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02823v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, Qian He</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.2, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation</h2>
            <p class="paper-summary">Diffusion models have shown excellent performance in text-to-image
generation. Nevertheless, existing methods often suffer from performance
bottlenecks when handling complex prompts that involve multiple objects,
characteristics, and relations. Therefore, we propose a Multi-agent
Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation
for complex scenes. Specifically, we design a multi-agent collaboration-based
scene parsing module that generates an agent system comprising multiple agents
with distinct tasks, utilizing MLLMs to extract various scene elements
effectively. In addition, Hierarchical Compositional diffusion utilizes a
Gaussian mask and filtering to refine bounding box regions and enhance objects
through region enhancement, resulting in the accurate and high-fidelity
generation of complex scenes. Comprehensive experiments demonstrate that our
MCCD significantly improves the performance of the baseline models in a
training-free manner, providing a substantial advantage in complex scene
generation.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: mccd introduces a multi-agent collaboration and hierarchical compositional diffusion approach to improve text-to-image generation for complex prompts by effectively parsing scenes and enhancing object regions.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: mccd 提出了一种基于多智能体协作和分层组合扩散的方法，通过有效解析场景和强化对象区域，来改进复杂提示的文本到图像生成。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02648v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, Lihua Zhang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.25, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities</h2>
            <p class="paper-summary">Recent years have seen remarkable progress in both multimodal understanding
models and image generation models. Despite their respective successes, these
two domains have evolved independently, leading to distinct architectural
paradigms: While autoregressive-based architectures have dominated multimodal
understanding, diffusion-based models have become the cornerstone of image
generation. Recently, there has been growing interest in developing unified
frameworks that integrate these tasks. The emergence of GPT-4o's new
capabilities exemplifies this trend, highlighting the potential for
unification. However, the architectural differences between the two domains
pose significant challenges. To provide a clear overview of current efforts
toward unification, we present a comprehensive survey aimed at guiding future
research. First, we introduce the foundational concepts and recent advancements
in multimodal understanding and text-to-image generation models. Next, we
review existing unified models, categorizing them into three main architectural
paradigms: diffusion-based, autoregressive-based, and hybrid approaches that
fuse autoregressive and diffusion mechanisms. For each category, we analyze the
structural designs and innovations introduced by related works. Additionally,
we compile datasets and benchmarks tailored for unified models, offering
resources for future exploration. Finally, we discuss the key challenges facing
this nascent field, including tokenization strategy, cross-modal attention, and
data. As this area is still in its early stages, we anticipate rapid
advancements and will regularly update this survey. Our goal is to inspire
further research and provide a valuable reference for the community. The
references associated with this survey will be available on GitHub soon.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper surveys the emerging field of unified multimodal understanding and generation models, categorizing existing approaches (diffusion, autoregressive, hybrid), highlighting key challenges, and providing resources for future research. it aims to guide research in this rapidly evolving area.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文综述了统一多模态理解和生成模型的新兴领域，对现有方法（扩散模型、自回归模型、混合模型）进行分类，强调了关键挑战，并为未来研究提供了资源。旨在指导这一快速发展领域的研究。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02567v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.30000000000000004, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Text to Image Generation and Editing: A Survey</h2>
            <p class="paper-summary">Text-to-image generation (T2I) refers to the text-guided generation of
high-quality images. In the past few years, T2I has attracted widespread
attention and numerous works have emerged. In this survey, we comprehensively
review 141 works conducted from 2021 to 2024. First, we introduce four
foundation model architectures of T2I (autoregression, non-autoregression, GAN
and diffusion) and the commonly used key technologies (autoencoder, attention
and classifier-free guidance). Secondly, we systematically compare the methods
of these studies in two directions, T2I generation and T2I editing, including
the encoders and the key technologies they use. In addition, we also compare
the performance of these researches side by side in terms of datasets,
evaluation metrics, training resources, and inference speed. In addition to the
four foundation models, we survey other works on T2I, such as energy-based
models and recent Mamba and multimodality. We also investigate the potential
social impact of T2I and provide some solutions. Finally, we propose unique
insights of improving the performance of T2I models and possible future
development directions. In summary, this survey is the first systematic and
comprehensive overview of T2I, aiming to provide a valuable guide for future
researchers and stimulate continued progress in this field.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this survey paper comprehensively reviews text-to-image generation (t2i) methods from 2021-2024, covering architectures, key technologies, performance comparisons, and potential social impact, also providing insights and future directions.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 这篇综述全面回顾了2021-2024年的文本到图像生成(t2i)方法，涵盖架构、关键技术、性能比较和社会影响，并提供了见解和未来方向。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02527v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Pengfei Yang, Ngai-Man Cheung, Xinda Ma</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.35000000000000003, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction</h2>
            <p class="paper-summary">We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a
newly designed unified visual generator and a native multimodal autoregressive
model tailored for unifying vision and language. Specifically, this project
provides an open-source implementation of the integrated MetaQueries and
M2-omni framework, while introducing the novel multi-scale learnable tokens and
multi-scale representation alignment strategy. By leveraging a fixed MLLM and a
learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to
perform both text-to-image generation and instruction based image editing
tasks, expanding their capabilities beyond pure visual understanding. Our
experimental results demonstrate the strong performance of Ming-Lite-Uni and
illustrate the impressive fluid nature of its interactive process. All code and
model weights are open-sourced to foster further exploration within the
community. Notably, this work aligns with concurrent multimodal AI milestones -
such as ChatGPT-4o with native image generation updated in March 25, 2025 -
underscoring the broader significance of unified models like Ming-Lite-Uni on
the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further
refined.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: ming-lite-uni is a new open-source multimodal framework featuring a unified visual generator and a native multimodal autoregressive model for vision and language tasks, enabling text-to-image generation and instruction-based image editing.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: ming-lite-uni 是一个新的开源多模态框架，具有统一的视觉生成器和原生多模态自回归模型，用于视觉和语言任务，支持文本到图像的生成和基于指令的图像编辑。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02471v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, Libin Wang, Qingpei Guo, Rui Liu, Weilong Chai, Xinyu Xiao, Ziyuan Huang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.4, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing</h2>
            <p class="paper-summary">Due to the challenges of manually collecting accurate editing data, existing
datasets are typically constructed using various automated methods, leading to
noisy supervision signals caused by the mismatch between editing instructions
and original-edited image pairs. Recent efforts attempt to improve editing
models through generating higher-quality edited images, pre-training on
recognition tasks, or introducing vision-language models (VLMs) but fail to
resolve this fundamental issue. In this paper, we offer a novel solution by
constructing more effective editing instructions for given image pairs. This
includes rectifying the editing instructions to better align with the
original-edited image pairs and using contrastive editing instructions to
further enhance their effectiveness. Specifically, we find that editing models
exhibit specific generation attributes at different inference steps,
independent of the text. Based on these prior attributes, we define a unified
guide for VLMs to rectify editing instructions. However, there are some
challenging editing scenarios that cannot be resolved solely with rectified
instructions. To this end, we further construct contrastive supervision signals
with positive and negative instructions and introduce them into the model
training using triplet loss, thereby further facilitating supervision
effectiveness. Our method does not require the VLM modules or pre-training
tasks used in previous work, offering a more direct and efficient way to
provide better supervision signals, and providing a novel, simple, and
effective solution for instruction-based image editing. Results on multiple
benchmarks demonstrate that our method significantly outperforms existing
approaches. Compared with previous SOTA SmartEdit, we achieve 9.19%
improvements on the Real-Edit benchmark with 30x less training data and 13x
smaller model size.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces superedit, a method for improving instruction-based image editing by rectifying editing instructions using vlm guidance based on observed generation attributes and incorporating contrastive supervision with triplet loss, achieving superior performance with less data and smaller model size compared to sota methods.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了superedit，一种通过修正编辑指令来改进基于指令的图像编辑的方法。该方法利用视觉语言模型（vlm）的指导，根据观察到的生成属性来修正编辑指令，并结合三元组损失进行对比监督。相比于现有最佳方法，superedit在更少的数据和更小的模型尺寸下实现了更卓越的性能。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02370v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.45, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Enhancing AI Face Realism: Cost-Efficient Quality Improvement in Distilled Diffusion Models with a Fully Synthetic Dataset</h2>
            <p class="paper-summary">This study presents a novel approach to enhance the cost-to-quality ratio of
image generation with diffusion models. We hypothesize that differences between
distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are
consistent and, therefore, learnable within a specialized domain, like portrait
generation. We generate a synthetic paired dataset and train a fast
image-to-image translation head. Using two sets of low- and high-quality
synthetic images, our model is trained to refine the output of a distilled
generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like
FLUX.1-dev, which is more computationally intensive. Our results show that the
pipeline, which combines a distilled version of a large generative model with
our enhancement layer, delivers similar photorealistic portraits to the
baseline version with up to an 82% decrease in computational cost compared to
FLUX.1-dev. This study demonstrates the potential for improving the efficiency
of AI solutions involving large-scale image generation.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces a cost-effective method to improve the quality of images generated by distilled diffusion models by training an image-to-image translation head on a synthetic dataset, achieving comparable quality to baseline models with significantly reduced computational cost.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文提出了一种经济高效的方法，通过在合成数据集上训练图像到图像的翻译头，来提高蒸馏扩散模型生成的图像质量，从而以显著降低的计算成本实现与基线模型相当的质量。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02255v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Jakub Wąsala, Bartłomiej Wrzalski, Kornelia Noculak, Yuliia Tarasenko, Oliwer Krupa, Jan Kocoń, Grzegorz Chodak</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.5, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Quantizing Diffusion Models from a Sampling-Aware Perspective</h2>
            <p class="paper-summary">Diffusion models have recently emerged as the dominant approach in visual
generation tasks. However, the lengthy denoising chains and the computationally
intensive noise estimation networks hinder their applicability in low-latency
and resource-limited environments. Previous research has endeavored to address
these limitations in a decoupled manner, utilizing either advanced samplers or
efficient model quantization techniques. In this study, we uncover that
quantization-induced noise disrupts directional estimation at each sampling
step, further distorting the precise directional estimations of higher-order
samplers when solving the sampling equations through discretized numerical
methods, thereby altering the optimal sampling trajectory. To attain dual
acceleration with high fidelity, we propose a sampling-aware quantization
strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to
impose a more stringent constraint on the error bounds at each sampling step,
facilitating a more linear probability flow. Extensive experiments on
sparse-step fast sampling across multiple datasets demonstrate that our
approach preserves the rapid convergence characteristics of high-speed samplers
while maintaining superior generation quality. Code will be made publicly
available soon.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper proposes a sampling-aware quantization strategy, specifically mixed-order trajectory alignment, for diffusion models to improve both speed and generation quality in resource-constrained environments by mitigating quantization-induced noise during sampling.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文提出了一种采样感知量化策略，即混合阶轨迹对齐，用于扩散模型，通过减轻采样过程中量化引起的噪声，从而在资源受限的环境中提高速度和生成质量。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02242v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Qian Zeng, Jie Song, Yuanyu Wan, Huiqiong Wang, Mingli Song</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.55, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Improving Physical Object State Representation in Text-to-Image Generative Systems</h2>
            <p class="paper-summary">Current text-to-image generative models struggle to accurately represent
object states (e.g., "a table without a bottle," "an empty tumbler"). In this
work, we first design a fully-automatic pipeline to generate high-quality
synthetic data that accurately captures objects in varied states. Next, we
fine-tune several open-source text-to-image models on this synthetic data. We
evaluate the performance of the fine-tuned models by quantifying the alignment
of the generated images to their prompts using GPT4o-mini, and achieve an
average absolute improvement of 8+% across four models on the public
GenAI-Bench dataset. We also curate a collection of 200 prompts with a specific
focus on common objects in various physical states. We demonstrate a
significant improvement of an average of 24+% over the baseline on this
dataset. We release all evaluation prompts and code.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper addresses the problem of text-to-image models struggling to represent object states by generating synthetic data and fine-tuning existing models, resulting in significant improvements in image-prompt alignment.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文通过生成合成数据并微调现有模型，解决了文本到图像模型难以表示对象状态的问题，从而显著提高了图像与提示的对齐程度。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02236v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Tianle Chen, Chaitanya Chakka, Deepti Ghadiyaram</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.6000000000000001, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning</h2>
            <p class="paper-summary">Multimodal Reward Models (MRMs) play a crucial role in enhancing the
performance of Multimodal Large Language Models (MLLMs). While recent
advancements have primarily focused on improving the model structure and
training data of MRMs, there has been limited exploration into the
effectiveness of long-term reasoning capabilities for reward modeling and how
to activate these capabilities in MRMs. In this paper, we explore how
Reinforcement Learning (RL) can be used to improve reward modeling.
Specifically, we reformulate the reward modeling problem as a rule-based RL
task. However, we observe that directly applying existing RL algorithms, such
as Reinforce++, to reward modeling often leads to training instability or even
collapse due to the inherent limitations of these algorithms. To address this
issue, we propose the StableReinforce algorithm, which refines the training
loss, advantage estimation strategy, and reward design of existing RL methods.
These refinements result in more stable training dynamics and superior
performance. To facilitate MRM training, we collect 200K preference data from
diverse datasets. Our reward model, R1-Reward, trained using the
StableReinforce algorithm on this dataset, significantly improves performance
on multimodal reward modeling benchmarks. Compared to previous SOTA models,
R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$
improvement on the Multimodal Reward Bench. Moreover, with more inference
compute, R1-Reward's performance is further enhanced, highlighting the
potential of RL algorithms in optimizing MRMs.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces r1-reward, a multimodal reward model trained using a novel stablereinforce algorithm, demonstrating significant performance improvements on multimodal reward modeling benchmarks.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了r1-reward，一个使用新型stablereinforce算法训练的多模态奖励模型，并在多模态奖励建模基准上展示了显著的性能提升。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02835v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.65, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Sim2Real in endoscopy segmentation with a novel structure aware image translation</h2>
            <p class="paper-summary">Automatic segmentation of anatomical landmarks in endoscopic images can
provide assistance to doctors and surgeons for diagnosis, treatments or medical
training. However, obtaining the annotations required to train commonly used
supervised learning methods is a tedious and difficult task, in particular for
real images. While ground truth annotations are easier to obtain for synthetic
data, models trained on such data often do not generalize well to real data.
Generative approaches can add realistic texture to it, but face difficulties to
maintain the structure of the original scene. The main contribution in this
work is a novel image translation model that adds realistic texture to
simulated endoscopic images while keeping the key scene layout information. Our
approach produces realistic images in different endoscopy scenarios. We
demonstrate these images can effectively be used to successfully train a model
for a challenging end task without any real labeled data. In particular, we
demonstrate our approach for the task of fold segmentation in colonoscopy
images. Folds are key anatomical landmarks that can occlude parts of the colon
mucosa and possible polyps. Our approach generates realistic images maintaining
the shape and location of the original folds, after the
image-style-translation, better than existing methods. We run experiments both
on a novel simulated dataset for fold segmentation, and real data from the
EndoMapper (EM) dataset. All our new generated data and new EM metadata is
being released to facilitate further research, as no public benchmark is
currently available for the task of fold segmentation.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper presents a novel image translation model for sim2real endoscopy segmentation, adding realistic texture to simulated images while preserving structural information, which allows training a segmentation model without real labeled data. the authors also release a new simulated dataset as well as new metadata for the endomapper dataset.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文提出了一种新的图像转换模型，用于内窥镜分割中的sim2real，在为模拟图像添加真实纹理的同时保留结构信息，从而无需真实标记数据即可训练分割模型。作者还发布了一个新的模拟数据集以及endomapper数据集的新元数据。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02654v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Clara Tomasini, Luis Riazuelo, Ana C. Murillo</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.7000000000000001, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models</h2>
            <p class="paper-summary">Recent advances in visual industrial anomaly detection have demonstrated
exceptional performance in identifying and segmenting anomalous regions while
maintaining fast inference speeds. However, anomaly
classification-distinguishing different types of anomalies-remains largely
unexplored despite its critical importance in real-world inspection tasks. To
address this gap, we propose VELM, a novel LLM-based pipeline for anomaly
classification. Given the critical importance of inference speed, we first
apply an unsupervised anomaly detection method as a vision expert to assess the
normality of an observation. If an anomaly is detected, the LLM then classifies
its type. A key challenge in developing and evaluating anomaly classification
models is the lack of precise annotations of anomaly classes in existing
datasets. To address this limitation, we introduce MVTec-AC and VisA-AC,
refined versions of the widely used MVTec-AD and VisA datasets, which include
accurate anomaly class labels for rigorous evaluation. Our approach achieves a
state-of-the-art anomaly classification accuracy of 80.4% on MVTec-AD,
exceeding the prior baselines by 5%, and 84% on MVTec-AC, demonstrating the
effectiveness of VELM in understanding and categorizing anomalies. We hope our
methodology and benchmark inspire further research in anomaly classification,
helping bridge the gap between detection and comprehensive anomaly
characterization.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces velm, a multi-modal llm-based pipeline for anomaly classification in industrial settings, and provides two new datasets, mvtec-ac and visa-ac, with accurate anomaly class labels for evaluation.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了一种基于多模态llm的异常分类管道velm，用于工业环境中的异常分类，并提供了两个新的数据集mvtec-ac和visa-ac，其中包含准确的异常类别标签，用于评估。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(3/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(5/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02626v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Sassan Mokhtar, Arian Mousakhan, Silvio Galesso, Jawad Tayyub, Thomas Brox</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.75, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation</h2>
            <p class="paper-summary">Chest X-rays (CXRs) are the most frequently performed imaging examinations in
clinical settings. Recent advancements in Large Multimodal Models (LMMs) have
enabled automated CXR interpretation, enhancing diagnostic accuracy and
efficiency. However, despite their strong visual understanding, current Medical
LMMs (MLMMs) still face two major challenges: (1) Insufficient region-level
understanding and interaction, and (2) Limited accuracy and interpretability
due to single-step reasoning. In this paper, we empower MLMMs with
anatomy-centric reasoning capabilities to enhance their interactivity and
explainability. Specifically, we first propose an Anatomical Ontology-Guided
Reasoning (AOR) framework, which centers on cross-modal region-level
information to facilitate multi-step reasoning. Next, under the guidance of
expert physicians, we develop AOR-Instruction, a large instruction dataset for
MLMMs training. Our experiments demonstrate AOR's superior performance in both
VQA and report generation tasks.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces an anatomical ontology-guided reasoning (aor) framework to improve medical large multimodal models' (mlmms) performance in chest x-ray interpretation by enhancing region-level understanding and multi-step reasoning.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了一种解剖本体引导推理 (aor) 框架，通过增强区域级理解和多步推理，来提高医学大型多模态模型 (mlmm) 在胸部 x 光片解释方面的性能。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(2/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(4/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02830v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Qingqiu Li, Zihang Cui, Seongsu Bae, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Junjun He, Shujun Wang</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.8, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">CSASN: A Multitask Attention-Based Framework for Heterogeneous Thyroid Carcinoma Classification in Ultrasound Images</h2>
            <p class="paper-summary">Heterogeneous morphological features and data imbalance pose significant
challenges in rare thyroid carcinoma classification using ultrasound imaging.
To address this issue, we propose a novel multitask learning framework,
Channel-Spatial Attention Synergy Network (CSASN), which integrates a
dual-branch feature extractor - combining EfficientNet for local spatial
encoding and ViT for global semantic modeling, with a cascaded channel-spatial
attention refinement module. A residual multiscale classifier and dynamically
weighted loss function further enhance classification stability and accuracy.
Trained on a multicenter dataset comprising more than 2000 patients from four
clinical institutions, our framework leverages a residual multiscale classifier
and dynamically weighted loss function to enhance classification stability and
accuracy. Extensive ablation studies demonstrate that each module contributes
significantly to model performance, particularly in recognizing rare subtypes
such as FTC and MTC carcinomas. Experimental results show that CSASN
outperforms existing single-stream CNN or Transformer-based models, achieving a
superior balance between precision and recall under class-imbalanced
conditions. This framework provides a promising strategy for AI-assisted
thyroid cancer diagnosis.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces csasn, a multitask attention-based framework for classifying heterogeneous thyroid carcinoma in ultrasound images, using a dual-branch feature extractor and a cascaded channel-spatial attention refinement module to address data imbalance and improve classification accuracy.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了一种名为csasn的多任务注意力框架，用于对超声图像中异构性甲状腺癌进行分类。该框架采用双分支特征提取器和级联通道空间注意力细化模块，旨在解决数据不平衡问题并提高分类准确性。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(2/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(4/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.02211v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Peiqi Li, Yincheng Gao, Renxing Li, Haojie Yang, Yunyun Liu, Boji Liu, Jiahui Ni, Ying Zhang, Yulu Wu, Xiaowei Fang, Lehang Guo, Liping Sun, Jiangang Chen</p>
            
        </motion.div>
        
    </div>

    <footer class="footer">
        Generated on 2025-05-08 04:31:07 UTC. Powered by <a href="https://github.com/onion-liu" target="_blank">onion-liu</a>.
    </footer>

</body>
</html>